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Abstract 

O 

I— ' In this paper, we study the MapReduce framework from an algorithmic standpoint and demonstrate 

^ the usefulness of our approach by designing and analyzing efficient MapReduce algorithms for funda- 

^ mental sorting, searching, and simulation problems. This study is motivated by a goal of ultimately 

putting the MapReduce framework on an equal theoretical footing with the well-known PRAM and 
BSP parallel models, which would benefit both the theory and practice of MapReduce algorithms. We 
describe efficient MapReduce algorithms for sorting, multi- searching, and simulations of parallel algo- 
' I rithms specified in the BSP and CRCW PRAM models. We also provide some applications of these 

results to problems in parallel computational geometry for the MapReduce framework, which result in 
efficient MapReduce algorithms for sorting, 2- and 3-dimensional convex hulls, and fixed-dimensional 
^ linear programming. For the case when mappers and reducers have a memory/message-I/O size of 

M = Q{N'^), for a small constant e > 0, all of our MapReduce algorithms for these applications run in 
^ a constant number of rounds. 

X 



1 Introduction 



The MapReduce framework ||5l [6l is a programming paradigm for designing parallel and distributed al- 
gorithms. It provides a simple programming interface that is specifically designed to make it easy for a 
programmer to design a parallel program that can efficiently perform a data-intensive computation. More- 
over, it is a framework that allows for parallel programs to be directly translated into computations for cloud 
computing environments and server clusters (e.g., see [16])- This framework is gaining wide-spread interest 
in systems domains, in that this framework is being used in Google data centers and as a part of the open- 
source Hadoop system [20] for server clusters, which have been deployed by a wide variety of enterprise^ 
including Yahoo!, IBM, The New York Times, eHarmony, Facebook, and Twitter. 

Building on pioneering work by Feldman et al. [9] and Karloff et al. [ 14], our interest in this paper is in 
studying the MapReduce framework from an algorithmic standpoint, by designing and analyzing MapRe- 
duce algorithms for fundamental sorting, searching, and simulation problems. Such a study could be a 
step on the way to ultimately putting the MapReduce framework on an equal theoretical footing with the 
well-known PRAM and BSP parallel models. 

Still, we would be remiss if we did not mention that this framework is not without its detractors. DeWitt 
and Stonebraker Q mention several issues they feel are shortcomings of the MapReduce framework, includ- 
ing that it seems to require brute-force enumeration instead of indexing for performing searches. Naturally, 
we feel that this criticism is a bit harsh, as the theoretical limits of the MapReduce framework have yet to be 
fully explored; hence, we feel that further theoretical study is warranted. Indeed, this paper can be viewed 
as at least a partial refutation of the claim that the MapReduce framework disallows indexed searching, in 
that we show how to perform fast and efficient multi-search in the MapReduce framework. 



1.1 The MapReduce Framework 

In the MapReduce framework, a computation is specified as a sequence of map, shuffle, and reduce steps 
that operate on a set X = {xi, X2, • • • , Xn] of values: 

• A map step applies a function, /x, to each value, xi, to produce a finite set of key-value pairs (/c, v). 
To allow for parallel execution, the computation of the function ^i{xi) must depend only on x,. 

• A shuffle step collects all the key-value pairs produced in the previous map step, and produces a set 
of lists, Lfc = (/c; til, f 2, ...), where each such hst consists of all the values, Vj, such that /cj = /c for a 
key k assigned in the map step. 

• A reduce step applies a function, p, to each list = (A;; ^1,^27 • • •)> formed in the shuffle step, to 
produce a set of values, yi,y2, ■ ■ ■ ■ The reduction function, p, is allowed to be defined sequentially 
on Lk, but should be independent of other lists L^i where k' ^ k. 

The parallelism of the MapReduce framework comes from the fact that each map or reduce operation 
can be executed on a separate processor independently of others. Thus, the user simply defines the func- 
tions p and p, and the system automatically schedules map-shuffle-reduce steps and routes data to available 
processors, including provisions for fault tolerance. 

The outputs from a reduce step can, in general, be used as inputs to another round of map-shuffle-reduce 
steps. Thus, a typical MapReduce computation is described as a sequence of map-shuffle-reduce steps that 
perform a desired action in a series of rounds that produce the algorithm's output after the last reduce step. 



See http : //en ■ wikipedia .org/ wiki/Hadoop[ 
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1.2 Evaluating MapReduce Algorithms 

Ideally, we desire the number of rounds in a MapReduce algorithm to be a constant. For example, consider 
an often-cited MapReduce algorithm to count all the instances of words in a document. Given a document, 
D, we define the set of input values X to be all the words in the document and we then proceed as follows: 

1. Map: For each word, w, in the document, map w\.o{w,l). 

2. Shuffle: collect all the (u;, 1) pairs for each word, producing a list {w; 1,1,..., 1), noting that the 
number of I's in each such list is equal to the number of times w appears in the document. 

3. Reduce: scan each list {w; 1,1,..., 1), summing up the number of I's in each such Ust, and output a 
pair {w, Uw) as a final output value, where Uw is the number of I's in the list for w. 

This single-round computation clearly computes the number of times each word appears in D. 

The number of rounds in a MapReduce algorithm is not always equal to 1, however, and there are, in 
fact, several metrics that one can use to measure the efficiency of a MapReduce algorithm over the course 
of its execution, including the following: 

• We can consider R, the number of rounds of map-shuffle-reduce that the algorithm uses. 

• If we let nr,i, nr,2! • • • denote the mapper and reducer 1/0 sizes for round r, so that rir^i is the size of 
the inputs and outputs for mapper/reducer i in round r, then we can define Cr, the communication 
complexity of round r, to be the total size of the inputs and outputs for all the mappers and reducers 
in round r, that is, Cr = Yli ''^r,i- We can also define the communication complexity, C = Ylf=o ^r, 
for the entire algorithm. 

• We can let tr denote the internal running time for round r, which is the maximum internal running 
time taken by a mapper or reducer in round r, where we assume tr > maxjjnr.j}, since a mapper or 
reducer must have a running time that is at least the size of its inputs and outputs. We can also define 
total internal running time, t = Y^^=o ^r, for the entire algorithm, as well. 

We can make a crude calibration of a MapReduce algorithm using the following additional parameters: 

• L: the latency L of the shuffle network, which is the number of steps that a mapper or reducer has to 
wait until it receives its first input in a given round. 

• B: the bandwidth of the shuffle network, which is the number of elements in a MapReduce computa- 
tion that can be delivered by the shuffle network in any time unit. 

Given these parameters, a lower bound for the total running time, T, of an implementation of a MapRe- 
duce algorithm can be characterized as follows: 



For example, given a document D ofn words, the simple word-counting MapReduce algorithm given above 
has a worst-case performance of i? = 1, C = 0(n), and t = G(n); hence, its worst-case time performance 
T = @{n), which is no faster than sequential computation. Unfortunately, such performance could be quite 
common, as the frequency of words in a natural-language document tend to follow Zipf 's law, so that some 
words appear quite frequently, and the running time of the simple word-counting algorithm is proportional 
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to the number of occurrences of the most- frequent word. For instance, in the Brown Corpus ifTSll . the word 
"the" accounts for 7% of all word occurrences^ 

Note, therefore, that focusing exclusively on R, the number of rounds in a MapReduce algorithm, can 
actually lead to an inefficient algorithm. For example, if we focus only on the number of rounds, R, then 
the most efficient algorithm would always be the trivial one-round algorithm, which maps all the inputs 
to a single key and then has the reducer for this key perform a standard sequential algorithm to solve the 
problem. This approach would run in one round, but it would not use any parallelism; hence, it would be 
relatively slow compared to an algorithm that was more "parallel." 

1.3 Memory-Bound and I/O-Bound MapReduce Algorithms 

So as to steer algorithm designers away from the trivial one-round algorithm, recent algorithmic formaliza- 
tions of the MapReduce paradigm have focused primarily on optimizing the round complexity bound, R, 
while restricting the memory size or input/output size for reducers. Karloff et al. 1 14] define their MapRe- 
duce model, MRC, so that each reducer's I/O size is restricted to be 0{n^~'^) for some small constant e > 0, 
and Feldman et al. |9| define their model, MUD, so that reducer memory size is restricted to be O(log^ n), 
for some constant c > 0, and reducers are further required to process their inputs in a single pass. These 
restrictions limit the feasibility of the trivial one -round algorithm for solving a problem in the MapReduce 
framework and instead compel algorithm designers to make better utilization of parallelism. 

In this paper, we follow the I/O-bound approach, as it seems to correspond better to the way reducer 
computations are specified, but we take a somewhat more general characterization than Karloff et al. llT4l . 
in that we do not bound the I/O size for reducers explicitly to be 0{n^^'^), but instead allow it to be an 
arbitrary parameter: 

• We define M to be an upper bound on the I/O-buffer memory size for all reducers used in a given 
MapReduce algorithm. That is, we predefine M to be a parameter and require that Vr, i : rir^i < M. 

We then can use M in the design and/or analysis of each of our MapReduce algorithms. For instance, if 
each round of an algorithm has a reducer that with an I/O size of at most M, then we say that this algorithm 
is an I/O-memory-bound MapReduce algorithm with parameter M. In addition, if each round has a reducer 
with an I/O size proportional to M (whose processing probably dominates the reducer's internal running 
time), then we can give a simplified lower bound on the time, T, for such an algorithm as 

T = n{R{M + L) + C/B). 

This approach therefore can characterize the limits of parallelism that are possible in a MapReduce algo- 
rithm and it also shows that we should concentrate on the round complexity and communication complexity 
of a MapReduce algorithm in characterizing its performance]^ Of course, such bounds for R and C may 
depend on M, but that is fine, for similar characterizations are common in the literature on external-memory 
algorithms (e.g., see |[Il|3l|4l[T8l[T9l). In the rest of the paper, when we talk about the MapReduce model, 
we always mean the I/O-memory-bound MapReduce model except when mentioned explicitly. 

1.4 Our Contributions 

We provide several efficient algorithms in the MapReduce framework for fundamental combinatorial prob- 
lems, including parallel prefix-sum, multi-search, and sorting. All of these algorithms run in 0(log^j N) 
map-shuffle -reduce rounds with high probability; hence, they are constant-round computations for the case 
when M is Q(N'^) for some constant e > 0. 

^http: //en.wikipedia.org/wiki/ Zipf ' s_law" 

^These measures correspond naturally with the time and work bounds used to characterize PRAM algorithms (e.g., see 1121 ^. 
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Unlike the sorting algorithm in the original paper describing MapReduce framework lO, our sorting 
algorithm is truly parallel for it does not require a central master node to compute partitioning sequentially. 

What is perhaps most unusual about the MapReduce framework is that there is no explicit notion of 
"place" for where data is stored nor for where computations are performed. This property of the MapReduce 
framework is perhaps what led DeWitt and Stonebraker Q to say that it does not support indexed searches. 
Nevertheless, we show that the MapReduce framework does in fact support efficient multi-searching, where 
one is interested in searching for a large number of keys in a search tree of roughly equal size. 

We also provide a number of simulation results. We show that any Bulk-Synchronous Parallel (BSP) 
algorithm [17] running in R super-steps with a memory of size and P < N processors can be simulated 
with a MapReduce algorithm in R rounds and communication complexity C = 0{RN) with reducer I/O- 
buffers of size M = 0{N/P). We also show that any CRCW PRAM algorithm running in T steps with P 
processors on a memory of size N can be simulated in the MapReduce framework in ii = ^^(Tlog^j P) 
rounds with C = 0{T{N + P) logj^^(A^ + P)) communication complexity. This latter simulation result 
holds for any version of the CRCW PRAM model, including the /-CRCW PRAM, which involves the 
computation of a commutative semigroup operator / on concurrent writes to the same memory location, 
such as in the Sum-CRCW PRAM [SJ. The PRAM simulation results achieve their efficiency through the 
use of a technique we call the invisible funnel method, as it can be viewed as placing virtual multi-way trees 
rooted at the input items. These trees funnel concurrent read and write requests to the data items, but are 
never explicitly constructed. The simulation results can be applied to solve several parallel computational 
geometry problems, including convex hulls and fixed-dimensional Unear programming. 

Roadmap. The rest of the paper is organized as follows. In Section|2} we first present our generic MapRe- 
duce framework which simplifies the development and exposition of algorithms that follow. In Section [3j 
we show how to simulate BSP and CRCW PRAM algorithms in the MapReduce framework. Finally in 
Section|4] we design MapReduce algorithms for multi-search and sorting. 

2 Generic MapReduce Computations 

In this section we define an abstract computational model that captures the MapReduce framework. 

Consider a set of nodes V . Let Av{r) be a set of items associated with each node v ^ V at round r, 
which defines the state of v. Also, let / be a sequential function defined for all nodes. Function / takes as 
input the state A^{r) of a node v and returns a new set B^{r), in the process destroying Ay{r). Each item of 
By{r) is of the form {w,a), where w and a is a new item. We define the following computation which 
proceeds in R rounds. 

At the beginning of the computation only the input nodes v have non-empty states A^,(0). The state of 
an input node consists of a single input item. 

In round r, each node v with non-empty state Ay{r) ^ performs the following. First, v applies 
function / on Ay{r). This results in the new set By{r) and deletion of Ay{r). Then, for each element 
b = {w, a) € By{r), node v sends item a to node w. Note that if w = v, then v sends a back to itself. As a 
result of this process, each node may receive a set of items from others. Finally, the set of received items at 
each node v defines the new state Ay{r + 1) for the next round. The items comprising the non-empty states 
Av{r) after R rounds define the outputs of the entire computation at which point the computation halts. 

The number of rounds R denotes the round complexity of the computation. The total number of all the 
items sent (or, equivalently, received) by the nodes in each round r defines the communication complexity Cr 
of round r, that is, Cr = |i?^(r)|. Finally, the communication complexity C of the entire computation is 
defined as C = X^,f=o^ Cr = ^1,^=0 Z^u l^f Note that this definition implies that nodes v whose states 
A^{r) are empty at the beginning of round r do not contribute to the communication complexity. Thus, the 
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set V of nodes can be infinite. But, as long as only a finite number of nodes have non-empty j4^(r) at the 
beginning of each round, the communication complexity of the computation is bounded. 

Observe that during the computation, in order for node v to send items to node w in round r, v should 
know the label of the destination w, which can be obtained by v in the following possible ways (or any 
combination thereof): 1) the link {v,w) can be encoded in / as a function of the label of v and round r, 2) 
some node might send the label of w to w in the previous round, or 3) node v might keep the label of w as 
part of its state by constantly sending it to itself. 

Thus, the above computation can be viewed as a computation on a dynamic directed graph G = {V, E), 
where an edge {v,w) & E m round r represents a possible communication link between v and w during that 
round. The encoding of edges {v,w) as part of function / is equivalent to defining an implicit graph [13]; 
keeping all edges within a node throughout the computation is equivalent to defining a static graph. For ease 
of exposition, we define the following primitive operations that can be used within / at each node v: 

• create an item; delete an item; modify an item; keep item x (that is, the item x will be sent to v itself 
by creating an item {v, x) G B^{r)); send an item x to node w (create an item {w, x) G By{r)). 

• create an edge; delete an edge. This is essentially the same as create an item and delete an item, since 
explicit edges are just maintained as items at nodes. This operations will simplify exposition when 
dealing with explicitly defined graphs G on which computation is performed. 

The following theorem shows that the above framework captures the essence of computation in the 
MapReduce framework: 

Theorem 2.1: Let G = (V, E) and f be defined as above such that in each round each node v ^ V sends, 
keeps and receives at most M items. Then computation on G with round complexity R and communication 
complexity C can be simulated in the I/O-memory-bound MapReduce model with the same round and 
communication complexities. 

Proof: We implement round r = of computation on G in the I/O-memory-bound MapReduce framework 
using only the Map and Shuffle steps and every round r > using the Reduce step of round r — 1 and a 
Map and Shuffle step of round r. 

1. Round r = 0: (a) Computing B^{r) = f{Ay{r)): Initially, only the input nodes have non-empty sets 
Ay{r), each of which contains only a single item. Thus, the output Bi,{r) only depends on a single 
item, fulfilling the requirement of Map. We define Map to be the same as /, i.e., it outputs a set of 
key-value tuples {w, x), each of which corresponds to an item {w, x) in By{r). (b) Sending items to 
destinations: The Shuffle step on the output of the Map step ensures that all tuples with key w will be 
sent to the same reducer, which corresponds to the node w in G. 

2. Round r > 0: First, each reducer v that receives a tuple {v; xi,X2, . ■ ■ , Xk) (as a result of the Shuffle 
step of the previous round) simulates the computation at node v in G. That is, it simulates the function 
/ and outputs a set of tuples {w,x), each of which corresponds to an item in By{r). We then define 
Map to be the identity map: On input {w, x), output key-value pair {w, x). Finally, the Shuffle step 
of round r completes the simulation of the round r of computation on graph G by sending all tuples 
with key w to the same reducer that will simulate node tt; in G in round r + 1. 

Keeping an item is equivalent to sending it to itself, thus, each node in G sends and receives at most M 
items. Therefore, no reducer receives or generates more than M items implying that the above is a correct 
I/O-memory-bound MapReduce algorithm. ■ 
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The above theorem gives an abstract way of designing MapReduce algorithms. More precisely, to design 
a MapReduce algorithm, we define graph G and a sequential function / to be performed at each node v ^V. 
This is akin to designing BSP algorithms and is more intuitive way than defining Map and Reduce functions. 

Note that in the above framework we can easily implement a global loop primitive spanning over mul- 
tiple rounds: Each item maintains a counter that is updated at each round. We can also implement parallel 
tail recursion by defining the labels of nodes to include the recursive call stack identifiers. 

Next, we show how we can implement an all-prefix-sum algorithm in the generic MapReduce model. 
This algorithm will then be used as a subroutine in a random indexing algorithm, which in turn will be used 



in the multi-search algorithm in Section 4. 1 



2.1 Prefix Sums and Random Indexing 

The all-prefix- sum problem is usually defined on an array of integers. Since there is no notion of arrays in 
the MapReduce framework, but rather a collection of items, we define the all-prefix- sum problem as follows: 
given a collection of items Xj, where xi holds an integer and an index value < i < — 1, compute for 
each item Xi a new value bi = I^}=o % • 

The MapReduce algorithm for all-prefix-sum problem is the following. Graph G = {V,E) i?, an undi- 
rectecj^rooted tree T with branching factor d = M /2 and height L = [log^ = ©(logjv/ N). The root of 
the tree is defined to be at level and leaves at level L — 1. We label the nodes in T such that the k-th node 
(counting from the left) on level I is defined as v = {I, k). Then, we can identify the parent of a non-root 
node = (/, /c) as p{v) = {I — l,\_k/d\) and the j-th child ofvuswj = {I + l,k ■ d + j). In other words, 
the neighborhood set of any node v £ T can be computed solely from the label of v, thus, we do not have 
to maintain edges explicitly. 

In the initialization step, each input node simply sends its input item Oj with index i to the leaf node 
V = {L — The rest of the algorithm proceeds in two phases, processing the nodes in T one level at a 
time. The nodes at other levels simply keep the items they have received during previous rounds. 

1. Bottom-up phase. For I = L — 1 downto 1 do: For each node v on level I do: If v is a leaf node, it re- 
ceived a single value Cj from an input node. The function f atv creates a copy s„ = a^, keeps it had 
received and sends s-a to the parent p{v) of v. If u is a non-leaf node, let wq, wi, . . . , Wd-i denote w's 
child nodes in the left-to-right order. Node v received a set of d items Av{r) = {swq , Swi , • • • , s«>d-i } 
from its children at the end of the previous round. f{Ay (r)) computes the sum Sy = Yl'j=o » sends 
Sv to p{v) and keeps all the items received by the children. 

2. Top-down phase. For / = to L — 1 do: For each node v on level / do: If v is the root, it had received 
items Ay{r) = {s^g, s^^, . . . , Swi^.i} at the end of the bottom-up phase. It creates for each child 
Wi {0 < i < d — 1) a new item s'- = J2]-^o ^^'^ sends it to Wi. If v is a non-root node, let Sp(t,) be 
the item received from its parent in the previous round. Inductively, the value Sp(t,) is the sum of all 
items "to the left" of v. If u is a leaf having a unique item a^, then it simply outputs + Sp(^) as a 
final value, which is the prefix sum X]j=o ^j- Otherwise, it creates for each child Wi {0 < i < d — I) 
a new item Sp(^) + X^}=o and sends it to Wi. In all cases, all items of v are deleted. 

Lemma 2.2: Given an index collection of N numbers, we can compute all prefix sums in the I/O-memory- 
bound MapReduce framework in 0{logj^,j N) rounds and 0{N logjv/ N) words of communication. 

Proof: The fact that the algorithm correctly computes all prefix sums is by induction on the values Sp(^,). 



In each round, each node sends and receives at most M items, fulfilling the condition of Theorem 2.1 The 
""Each undirected edge is represented by two directed edges in G. 
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total number of rounds is 2L = ©(logj^j A^) plus the initial round of sending input elements to the leaves of 
T. The total number of items sent in each round is dominated by items sent by N leaves, which is 0{N) 
per round. Applying Theorem |2 . 1 1 completes the proof. ■ 

Quite often, the input to the MapReduce computation is a collection of items with no particular ordering 
or indexing. If each input element is annotated with an estimate N < N < of the size of the input, for 
some constants c > 1, then using the all-prefix-sum algorithm we can generate a random indexing for the 
input with high probability. 

We modify the all-prefix-sum algorithm above as follows. We define the tree T on leaves, thus, the 
height of the tree is L = [3 log^ iV] . In the initialization step, each input node picks a random index i in the 
range [0, — 1] and sends Oj = 1 to the leaf node v = {L — ofT. Let be the number of items that 
leaf V receives. Note it is possible that n„ > 1, thus, we perform the all-prefix-sums computation with the 
following differences at the leaf nodes. During the bottom-up phase, we define = at the leaf node v. 
At the end of the top-down phase, each leaf v assigns each of the item that it received from the input nodes 
the indices Sp(„) + 1, Sp(„) +2, . . . , Sp(t,) +n„ in a random order, which is the final output of the computation. 

Lemma 2.3: A random indexing of the input can he performed on a collection of data in the I/O-memory- 
bound MapReduce framework in 0{logj^,j N) rounds and 0{N log^/ N) words of communication with high 
probability. 

Proof: First, note that the probability that > M at some leaf vertex is at most A^~^(*^). Thus, with 
probability at least 1— A^^^(^^), no leaf and, consequently, no node of Treceives more than 0{M) elements. 
Second, note that at most N leaves of the tree T have A„(r) ^ 0. Since we do not maintain the edges of 
the tree explicitly, the total number of items sent in each round is again dominated by the items sent by at 
most N leaves, which is 0{N) per round. Finally, the round and communication complexity follows from 
Lemma [Z2l ■ 



3 Simulating BSP and CRCW PRAM Algorithms 

In this section we show how to simulate BSP and CRCW PRAM algorithms in the MapReduce framework. 
Our methods therefore provide extensions of the simulation result of Karloff et al. |[T4ll . who show how to 
optimally simulate any FREW PRAM algorithm in the MapReduce framework]^ 

3.1 Simulating BSP algorithms 

In the BSP model [17], the input of size A'^ is distributed among P processors so that each processor contains 
at most M = \N/P'\ input items. A computation is specified as a series of super-steps, each of which 
involves each processor performing an internal computation and then sending a set of up to M messages to 
other processors. 

The initial state of the BSP algorithm is an indexed set of processors {pi,P2j • • • iPp} and an indexed 
set of initialized memory cells {mi^i, mi 2, • • • , ?7ip,m}> such that mj j is the j-th memory cell assigned to 
processor i. Since our framework is almost equivalent to the BSP model, the simulation is straightforward: 

^ Their original proof was identified for tlie CREW PRAM model, but there was a flaw in that version, which could violate the 
I/O-buffer-memory size constraint during a CREW PRAM simulation. Based on a personal communication, we have learned that 
the subsequent version of their paper will identify their proof as being for the EREW PRAM. 
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• Each processor Pi {I < i < P) defines a node v-i in our generic MapReduce grapli G, and the 
internal state vTj of pi and its memory cells {rrii^i, . . . ,mi^rn} define the items A^^ of node Vi. In the 
BSP algorithm, in each super-step each processor pi performs a series of computation, updates its 
internal state and memory cells to vr^ and {m[ i, . . . ,m[ ^}, and sends a set of messages /xj^ , . . . , /xj^. 
to processors pj^, . . . , pj^ , where the total size of all messages sent or received by a processor is 
at most M. In our MapReduce simulation, function / at node Vi performs the same computation, 
modifies items {vTj, mj^i, . . . , nii^m} to {iT'-,m^ ^, . . . ,m'^ ^} and sends items fij^ , • ■ • , fJ-j^ to nodes 
^ii ' • • • ' ^ifc • 

Theorem 3.1: Given a BSP algorithm A that runs in R super-steps with a total memory size N using 
P < N processors, we can simulate A using 0{R) rounds and C = 0{RN) communication in the I/O- 
memory-bound MapReduce framework with reducer memory size bounded by M = \N/ P] . 

Applications. By Theorem |3.1[ we can directly simulate BSP algorithms for sorting ifTTTl and convex 
hulls [10], achieving, for each problem, ©(logjyj A^) rounds and 0{N log]<^j N) communication complexity. 

In Section |43] we will present a randomized sorting algorithm, which has the same complexity but is 
simpler than directly simulating the complicated BSP algorithm in [11 1 . 

3.2 Simulating CRCW PRAM algorithms 

In this section we present a simulation of any /-CRCW PRAM model, the strongest variant of the PRAM 
model, where concurrent writes to the same memory location are resolved by applying a commutative semi- 
group operator / on all values being written to the same memory address, such as Sum, Min, Max, etc. 

The input to our simulation of a PRAM algorithm A assumes that the input is specified by an indexed set 
of P processor items, pi, . . . ,pp, as well as an indexed set of initialized PRAM memory cells, mi, . . . ,mN, 
where N is the total memory size used by A. 

The main challenge in simulating the algorithm A in the MapReduce model is that there may be as many 
as P reads and writes to the same memory cell in any given step and P can be significantly larger than M, 
the memory size of reducers. Thus, we need to have a way to "fan in" these reads and writes. We accomplish 
this by using invisible funnel technique, where we imagine that there is a different implicit C'(M)-ary tree 
rooted at each memory cell that has the set of processors as its leaves. Intuitively, our simulation algorithm 
involves routing reads and writes up and down these N trees. We view them as "invisible", because we do 
not actually maintain them explicitly, since that would require Q{PN) additional memory cells. 

The invisible funnels constructed here are similar to the one constructed for computing random in- 
dexing in Section [2!T| each of which is a multi-way tree with fan-out parameter d = M/2 and height 
L = [logrf P] = 0{logj^j P). Recall that according to our labeling scheme, given a node v = {j, l,k), the 
k-\h node on level / of the j-th tree, we can uniquely identify the label of its parent p{v) and each of its d 
children. 

We view the computation specified in a single step in the algorithm A as being composed of a read sub- 
step, followed by a constant-time internal computation, followed by a write sub-step. At the initialization 
step, we send rrij to the root node of the j-th tree, i.e., nij is sent to node (j, root) = (j, (0, 0)). For each 
processor (1 < i < -P), we send items pi and vTj to node Ui, where vTj is the internal state of processor 
Pi. Again, throughout the algorithm, each node keeps the items that it has received in previous rounds until 
they are explicitly deleted. 

1 . Bottom-up read phase. For each processor pi that attempts to read memory location rrij , node n, sends 
an item encoding a read request (in the following we simply say a read request) to the i-th leaf node 
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of the j-th tree, i.e. to node {j, L — indicating that it would like to read the contents of the j-th 
memory cell. 

For I = L — I downto 1 do: 

• For each node v at level if it received read request(s) in the previous round, then it sends a read 
request to its parent p{v). 

2. Top-down read phase. The root node in the j-th ti^ee sends the value mj to child (j, Wk) if child Wk 
has sent a read request at the end of the bottom-up read phase. 

For / = 1 to L - 2 do: 

• For each node v at level /, if it received rrij from its parent in the previous round, then it sends 
rrij to all those children who have sent v read requests during the bottom-up read phase. After 
that V deletes all of its items. 

For each leaf v, it sends rrij to the node Ui {1 < i < P) if Ui has sent v a read request at the beginning 
of the bottom-up read phase. After that v deletes all of its items. 

3. Internal computation phase. At the end of the top-down phase, each node Ui receives its requested 
memory item rnj, it performs the internal computation, and then sends an item z encoding a write 
request to the node (j, L — 1, i) if processor pi wants to write z to the memory cell rrij. 

4. Bottom-up write phase. For I = L — 1 downto do: 

• For each node v at level Z, if it received write request(s) in the previous round, let zi , . . . , (A; < 
d) be the items encoding those write requests. If v is not a root, it applies the semigroup function 
on input zi, . . . , Zk, sends the result z' to its parent, and then deletes all of its items. Otherwise, 
if w is a root, it modifies its current memory item to z'. 

When we have completed the bottom-up write phase, we are inductively ready for simulating the next 
step in the PRAM algorithm. We have the following. 

Theorem 3.2: Given an algorithm A in the CRCW PRAM model, with write contiicts resolved according 
to a commutative semigroup function such that A runs in T steps using P processors and N memory cells, 
we can simulate A in the I/O-memory-bound MapReduce framework in R = ©(Tlog^j P) rounds and 
with C = 0(T{N + P) log^j{N + P)) communication complexity. 

Proof: Each round in the CRCW PRAM algorithm is simulated by ©(logj^j P) rounds in the I/O-memory- 
bound MapReduce algorithm, and the total number of items sent is 0{N) per round. ■ 



Applications. By Theorem 3.2 we can directly simulate any CRCW (thus, also CREW) PRAM algorithm, 
in particular, linear programming in fixed dimensions by Alon and Megiddo O. The simulation achieves 
©(log^j N) rounds and 0{N logj\f A^) communication complexity. 
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4 Multi-searching and Sorting 



In this section, we present a method for performing simultaneous searches on a balanced search tree data 
structure. Let The a balanced binary search tree and Q be a set of queries. Let = IT] + The problem 
of multi-search asks to annotate each query q £ Q with a leaf v £ T, such that the root-to-leaf search path 
for g in T terminates at v. 

Goodrich [10] provides a solution to the multi-search problem in the BSP model. However, directly 
simulating the BSP algorithm in the I/O-memory-bound MapReduce model has two issues. 

First, the model used in 111 01 is a non-standard BSP model for it allows a processor to keep an unlimited 
number of items between rounds while still requiring each processor to send and receive at most \N/ P] = 
M items. However, a closer inspection of [ 10| reveals that the probability that some processor will contain 
more than M items in some round is at most N''^ for any constant c > 1. Therefore, with high probability 
it can still be simulated in our MapReduce framework. 

Second, the BSP solution requires 0{N log^/ N) space. Thus, Theorem |3. l| provides us with a MapRe- 
duce algorithm with communication complexity 0{N log|,/ A^). In this section we improve this communica- 
tion complexity to 0{N logjvf N), while still achieving 0{logM N) round complexity with high probability. 

In section [42| we also describe a queuing strategy which reduces the probability of failure due to the first 
issue of the simulation. The queuing algorithm might also be of independent interest because it removes 
some of the requirements of the framework of Section [2] 

4.1 Multi-searching 

As mentioned before, with high probability we can simulate the BSP algorithm of Goodrich ifTOl in MapRe- 
duce model in R = 0{logf^.f N) rounds and C = 0{N log\.j N) communication complexity. In this section 
we present a solution to reduce the communication complexity by a ©(log^j N) factor. 

The main reason for the large communication complexity of the simulation is the 0{N log^/ A^) size of 
the search structure that the BSP algorithm constructs to relieve the congestion caused by multiple queries 
passing through the same node of the search tree. It is worth noting that if the number of queries is small 
relative to the size of the search tree, that is, if \Q\ < N/ log^/ N, then the size of the BSP search struc- 
ture is only linear and we can perform the simulation of the algorithm with 0(A^logj^/ N) communication 
complexity. Thus, for the remainder of this section we assume \Q\> N/ logjv/ N. 

Consider a MapReduce algorithm A that simulates the BSP algorithm for a smaller set of queries, 
namely Q' of size only \N/ log a/ N^^ . Given a search tree T, algorithm A converts T into a directed acyclic 
graph (DAG) G (see [10] for details). G has logjv/ N levels and 0{N/ logjv/ N) nodes in each level (thus 
0{N / \ogj^ N) source nodes). Therefore the size of G is 0{N). Next, A propagates the queries of Q' 
through G. In each round, with high probability, all queries are routed one level down in G. Thus, the round 
complexity of A is still ©(log^/ N) while the communication complexity is 0{N log a/ N). 

To solve the multi-search problem on the input set Q, we make use of A as follows. We partition the 
set of queries Q into logjv/ N random subsets Qi,Q2, ■ ■ ■ , Qiogj^j N each containing 0{N/ logjv/ N) queries. 
Next, we construct G for the query set Qi and also use it to propagate the rest of the query sets. In particular, 
we proceed in 0(logj^j A^) rounds. In each of the first iog^ ^ rounds we feed new subset Qi of queries to 
the 0{N/ logj\f A^) source nodes of G and propagate the queries down to the sinks using algorithm A. This 
approach can be viewed as a pipelined execution of log^/ A^ multi-searches on G. 

We implement random partitioning of Q by performing a random indexing for Q (Lemma |2.3| ) and 
assigning query with index j to subset Q[j/iogj,j Af]- A node v containing a query q £ Qi keeps q (by 
sending it to itself) until round i, at which point it sends q to the appropriate source node of G. 

Theorem 4.1: Given a binary search tree T of size N , we can perform a multi-search of N queries over 
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T in the I/O-memory-bound MapReduce model in C(log^ N) rounds with 0{N logj^j N) communication 
with high probability. 

Proof: We sketch the proof here. Let Li, . . . , Lxo^^^ n be the logf^ N levels of nodes of G. First, all query 
items in the first query batch Qi can pass (i.e., be routed down) Lj (1 < j < logjv/ N) in one round with 
high probability. This is because for each node v in Lj, at most M query items of Qi will be routed to v with 
probability at least 1 — N^'^ for any constant c. By taking the union of all the nodes in Lj, we have that with 
probability at least 1 — 0{N/ logjvf N) ■ N^^, Qi can pass Lj in one round. Similarly, we can prove that 
any Qi {I < i < log^j N) can pass Lj {1 < j < log a/ N) in one round with the same probability since sets 
Qi have equal distributions. Since there are logj\/ batches of queries and they are fed into G in a pipeline 
fashion, by union bound we have that with probability at least 1 — log\j N ■0{N/ log^/ N) ■ N~'^ > 1 — 
(by choosing a sufficient large constant c) the whole process completes within 0{logi^.f N) rounds. The 
communication complexity follows directly since we only need to send OdGj + \Q\) = 0{N) items in 
each round. ■ 



4.2 FIFO Queues in MapReduce Model 

As mentioned at the beginning of this section, with probability 1 — N^'^ for any constant c > 1 no processor 
in the BSP algorithm for multi- searching contains more than M items. Thus, the algorithm for multi-search 
in the previous section can be implemented in the I/O-memory-bound MapReduce framework with high 
probability. 

However, the failure of the algorithm implies a crash of a reducer in the MapReduce framework, which 
is quite undesirable. In this section we present a queuing strategy which ensures that no reducer receives 
more than M items, which might be of independent interest. 

Consider the following modified version of the generic MapReduce framework from Section [2] In this 
version we still require each node v to send at most M items. However, instead of limiting the number 
of items that a node keeps or receives to be M, we only require that in every round at most M different 
nodes send to any given node v, and function / takes as input a list of at most M items. To accommodate 
the latter requirement, if a node receives or contains more than M items, the excess items are kept within 
the node's input buffer and are fed into function / in batches of 0{M) items per round in a first-in- first-out 
(FIFO) order. 

In this section we show that any algorithm A with round complexity R and communication complexity C 
in the modified framework can be implemented using the framework in Section |2] with the same asymptotic 
round and communication complexities. 

We simulate algorithm A by implementing the FIFO queue at each node w by a doubly-linked list L^ of 
nodes, such that L^ n L^, = for dX\v ^ w and n F = for all v € V. Each node v G V keeps a pointer 
headi^ to the head of its list L^. In addition, v also keeps rihead, the number of query items at headi^- If 
Lt, is empty, headi^ points at v and Uhead = 0. Throughout the algorithm we maintain an invariant that for 
each doubly-linked list Ly, each node in L„ contains [M/4, M/2] query items except the head node, i.e., 
the one containing the last items to be processed in the queue, which contains at most M/2 query items. 
We simulate one round of A by the following three rounds. Let ZJ\f{v) and OUT{v) denote the set of in- 
and out-neighbors of node v & V , respectively. That is, for each u G IJ\f{v), {u, v) ^ E and for each 
w G OUTiv), {v,w) G E. 

Rl. Each node u ^ V that wants to send nu,v query items to v G OUT{u), instead of sending the actual 
query items, sends nu,v to v. 
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R2. Each node v €^ V receives a set of different values „,nu2,v, • • • ,n'Uk,v from its in-neighbors 
Mi,U2, . . . , Ufc {k < M). For convenience we define n„„^t, = rihead- Next, v partitions the set 
{0, 1, . . . , A;} into sets ^i, . . . , S-m, m < k, such that M /4 < J2j(^Si ^Uj,v < M/2 for all 1 < i < 
m — 1 and X^jg^^ nuj,v < M/2. W.l.o.g., assume that G Si. For each Si, we will have a corre- 
sponding node in the list L„: We let wi = headi^ and for each Si, 2 < i < m we pick a new 
node Wi, create edges {wi,Wi-i) and (wi-i, Wi), and send it to nodes Wi and Wi^i, respectively. For 
each j £ Si, we also notify uj that it should send all its queries to Wi by sending the label of Wi to Uj. 
The only exception to this rule is that if wi v and wi contains the edge {wi,v), i.e. it is the first 
node in L^. In this case, for each j G 5i each Uj should send queries directly to v. Finally, we update 
the pointer head^^^ to point to Wm and update rihead = Sje5m '^uj,v, unless Wm = v, in which case 

nhead = 0. 

R3. Each node uj £ ZJ\f{v) receives the label of a node Wi from v in the previous rounds. It sends all its 
query items to Wi. Note that if Wi = v, all items will be sent to v directly. At the same time, each node 
w ^ V, i.e. w G Ly, that has an edge {w, v) for some v £ V sends all its items to v and extracts itself 
from the list. The node w accomplishes this by deleting all edges incident to w and by sending to its 
predecessor pred(u;) in the queue L„ a new edge (pred(i(;), v), thus, linking the rest of the queue to v. 

Theorem 4.2: Consider a modified MapReduce framework, wtiere in every round eacti node is required 
to send at most M items, but is allowed to keep and receive an unlimited number of items as long as they 
arrive from at most M different nodes, with excess items stored in FIFO input buffer and fed into function 
f in blocks of size at most M. Let A be an algorithm in this modified MapReduce framework with R round 
complexity and C communication complexity. Then we can implement A in the original I/O-memory-bound 
MapReduce framework in 0{R) rounds and 0{C) communication complexity. 

Proof: First, it is easy to see that our simulation ensures that each node keeps as well as sends and receives 
at most M items. Next, note that in every three rounds (round 3t, 3t + 1, 3f + 2), each node v £ V routes 
min{0(M), /c* } items, where /c* is the combined number of items in the queue and the number of items 
that w's in-neighbors send to v during the three rounds. This is within a constant factor of the number of 
items that v routes in round t in algorithm A. Finally, the only additional items we send in each round are 
the edges of the queues {L^ \ v G V}. Note that we only need to maintain 0(1) additional edges for each 
node of each L^. And since these nodes are non-empty, the additional edges do not contribute more than a 
constant factor to the communication complexity. ■ 



Applications. The DAG G of the multi-search BSP algorithm llTOl satisfies the requirement that at most 
M nodes attempt to send items to any other node. In addition, if some processor of the BSP algorithm 
happens to keep more than M items, the processing of these items is delayed and can be processed in any 



order, including FIFO. Thus, the requirements of Theorem 4.2 are satisfied. 

We do not know how to modify our random indexing algorithm in Section 2.1 to fit the modified frame- 
work. Thus, we cannot provide a Las Vegas algorithm. However, the above framework reduces the proba- 
bility of failure from A^~^(i) to the probability of failure of the random indexing step, i.e., N~^^^\ which 
is much smaller for large values of M. 

The modified framework might be of independent interest because it allows for an alternative way of 
designing algorithms for MapReduce. In particular, it removes the burden of keeping track of the number of 
items kept or sent by a node. 

4.3 Sorting 

In this section, we show how to obtain a simple sorting algorithm in the MapReduce model by using our 
multi-search algorithm. First, it is easy to obtain the following brute-force sorting result, which is proved in 
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Appendix [a] 



Lemma 4.3: Given a set X of N indexed comparable items, we can sort ttiem in ©(log^ A^) rounds and 
0{N'^ logj\/ A^) communication complexity in the MapReduce model. 

Combining the brute-force sorting algorithm with the multi-searching algorithm in the previous section, we 
present here a simple sorting algorithm with optimal round and communication complexities. 



1. Pick 0(v A^) random pivots. Sort the pivots using brute-force sorting algorithm. This results in the 
pivots being assigned a unique index/label in the range [1, ^/N]. 

2. Build a search tree on the set of pivots as the leaves of the tree. 

3. Perform a multi-search on the input items over the search tree. The result is the label associated with 
each item which is equal to the "bucket" within which the input is partitioned into. 

4. Recursively sort each bucket in parallel. 



Combined with Lemma 4.3 it is easy to see that this sorting algorithm runs in 0{logj^j N) rounds and has 
0{N log]<^j N) communication complexity with high probability. 
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A Brute-Force Multi-search and Sorting 



In this section, we first present a brute force multi-search algorithm, and then use it to design a brute force 
sorting algorithm. 

The input for multi-search is a set X = {xi , . . . , Xn} of n query items and a sorted setY = {yi, . . . ,yn} 
of m items corresponding the the leaves of the search tree. The goal is for each Xi {1 < i < n) to find the 
leaf yj such that search path for Xi will terminate at yj. Moreover, for each leaf yj (j = 1, . . . , m), we want 
to compute the number of items in X whose search paths will terminate at yj . The input for sorting is a set 
X = {xi, . . . , Xn} of n items. The goal is to sort the n items. We assume that set X is indexed, otherwise 



we can first perform the random indexing by Lemma 2.3 Note that the set Y is sorted thus indexed by 
default. 

Brute-Force Multi-search At the beginning of the algorithm, let nodes {pi, . . . be the input nodes 
containing input items xi, . . . , x„, respectively. And nodes {qi, . . . , qn} be the input nodes containing input 
items yi, . . . , y„, respectively. The input items are always kept during the computation. 

1. Generate all pairs: First, pi sends Xi to node i for alH = 1, . . . , n, and qj sends yj to node vij for 
all j = 1, . . . , m. Next, for / = 1 to log a/ m do: 

• For each i G [n], for each node Vij containing an item Xj, it keeps x,; and sends a copy of Xj to 
nodes j/ , • • • , Vij'^^ , where = (j — 1) • M + A; for 1 < /c < M. 

Similarly, for / = 1 to logjvf n do: 

• For each j G [m], for each node Vij containing an item yj, it keeps yj and sends a copy of yj to 
nodes Vi'^j, . . . , Vi'^^j, where i'f^ = {i — 1) ■ M + k for I < k < M. 

2. Compare each pair of items: Each node Vij compares its item Xi and yj. If Xj < yj, Vij generates 
(xj, 0) and keeps it; otherwise it generates (xj, 1) and keeps it. 

3. Add up values: For each i E [n], let (xj, bj) {bj G {0, 1}) be the item stored at node Vij for j = 
1, . . . , m. We compute ki = J2^=i the same way as the bottom-up phase of computing the prefix 



sums in Section 2. 1 Then y^. is the leaf node in Y where the search path of Xj ends. 

Similarly, for each j G [m], let (xj, bj) {bj G {0, 1}) be the item stored at node Vi^j for i = 1, . . . , n. 
We compute q = XlILi which is the number of query items in X whose search paths end at yj. 

Brute-Force Sorting The brute force sorting can be solved by the brute force multi-search algorithm. 
We create a copy of X and think it as the set Y . And then we run the algorithm for multi-search. The ki 
computed for each xi is the rank of item Xj. 
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